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I .   Introduction 


This  survey  is  motivated  by  the  fact  that  the  tremendous  potential 
power  of  microstructure  technology  can  be  realized  only  if  we  find 
effective  parallel  architectures  and  algorithms  for  utilizing  large 
numbers  of  small  but  powerful  processors.  Many  groups,  including  the 
very  active  11  YU-Ul  tracomputer  group  ,  are  considering  the  pragmatic 
questions  involved  in  the  choice  of  effective  parallel  architectures. 
Theoretical  studies,  like  that  on  which  this  presentation  focuses,  can 
buttress  this  pragmatic  work  in  two  ways:  by  finding  parallel 
algorithms  that  use  such  machines  effectively,  and  by  defining 
abstract  models  of  parallel  computing  thereby  clarifying  the  "design 
space"  within  which  the  computer  architect  can  make  choices. 
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outlined  below  reflects  this  point  of  view.  Existing 
iency  parallel  algorithms  are  mentioned  in  the  next  section. 

to  use  these  algorithms  to  discern  the  significance  of 
s  between  abstract  models  of  parallelism.  This  point  is 
n  Section  3  where  studies  regarding  limitations  of  these 
odels  of  parallel  computation  are  reviev/ed.   Last  we  discuss 

investigate  the  efficiency  with  which  abstract  models  of 
m  can  be  realized  concretely.  Two  lists  of  published  works 
Those  referenced  in  the  text  appear  in  the  first  list; 
uable  works  appear  in  the  second.  A  special  effort  was  made 
cent  work  and,  in  particular,  papers  not  given  in  the 
bibliographies  of  the  papers  [He-78],  [Ku-773  and  [Sc-80]. 


I I .   studies  of  Parallel  Algorithms 
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This  abstract  model  of  parallel  computation  was  studied  by  Shiloach 
and  Vishkin  [SV-81].  It  is  essentially  identical  to  the  SIHDAG  of 
Goldschlager  [Go-82],  and  is  closely  related  to  the  P-RAH  of  Fortune 
and  VJyllie  [FV/-78].  Schwartz  [Sc-BO]  calls  a  similar  model  "the 
Paracomputer ".  Although  Schwartz  notes  the  physical  difficulty  of 
implementing  the  unbounded  fan-in  that  this  abstract  model  of 
computation  requires,  he  does  state  that  such  models  "can  play  a 
useful  role  as  theoretical  yardsticks  for  measuring  the  limits  of 
parallel  computation".  The  reader  is  refered  to  Cook  [C0-8G]  for  an 
extensive  survey  of  models  of  parallel  computation  that  were  suggested 
in  the  1 i  terature . 
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Let  Seq(n)  be  the  fastest  worst-case  running  time  of  a  sequential 
algorithm  for  a  certain  problem  of  input  size  n.  Obviously,  the  best 
upper  bound  on  the  parallel  running  time  achievable,  without  improving 
the  sequential  result,  for  an  algorithm  using  p  processors  in  the  CRCVJ 
PRAH  is  of  the  form  0(Seq(n)/p).    An   algorithm   that   achieves   this 
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running  time  can  be  said  to  have  "optimal  speed-up",  or  nore  simply  to 
be  "optimal".  He.  are  interested  in  algorithms  that  are  either  optimal 
or  close  to  opt  imal . 

Upper  bounds  for  a  wide  assortment  of  problems  (required  to  justify 
the  CRCW  PRAM  model  in  the  sense  explained  above)  have  been  reported 
in  the  literature.  Many  such  algorithms  have  been  obtained  in  models 
of  parallel  computation  that  can  be  efficiently  implemented  in  the 
CRCl.'  PRAIJ  model  (since  they  are  designed  for  closely  related  or  more 
restrictive  models  of  parallel  computation.)  The  following  remarks 
review  salient  elements  of  this  work. 

Upper  bounds  on  the  worst-case  resource  requirements  of  algorithms 
which  are  designed  for  a  CRCVJ  PRAM  are  usually  presented  as:  Depth 
0(y)  for  z  processors  and  m  common  memory  locations  (y,2,D  may  be 
functions  of  the  input  parameters.)  An  equivalent  formulation  of  such 
a  result  is:  Depth  0((y*z)/p)  for  all  p  <=  z  processors  and  m  common 
memory  locations  for  the  same  y,z  and  m.  VJe  use  mostly  the  second 
formulation  and  emphasize  algorithms  where  y«z  is  not  significantly 
bigger  than  the  running  time  of  the  best  sequential  algorithm  for  the 
same  problem.  All  the  elementary  operations  required  to  handle  a 
problem,  including  allocation  of  processors  to  subtasks,  must  be  taken 
into  account  in  evaluating  the  time  complexity  of  these  algorithms. 

IJumerous  high-efficiency  parallel  algorithms  indicating  the  robustness 

of   the   CRCW   PRAM   model  can  be  noted.   Shiloach  and  Vishkin  [SV-81] 

give  techniques  for  finding  the  maximum,  merging  and  sorting  as 
f ol lows . 

1.  Finding  the  maximum  of  n  elements  in  time 

0(n/p  +  log  log  p)   for  1  <  p  <=  n 
(optimal  for  p  <=  n/log  log  n) 

2.  Merging  tv;o  sorted  lists  of  length  m  and  n  (ra  <=  n)  in  time 

0(n/p  +  log  n)  for  p  <=  n 
(optimal  for  p  <=  n/log  n),  and 

Odog  m/log  (p/n+1))  for  p>  =  n  (=0(k)  if  p=  [  (  m««  (  1 /k  )  )n  ]  )  . 

Borodin  and  Hopcroft  [BH-82]  give  a  merging  algorithm  requiring 
0((n/p)  log  log  n)  time  for  p  <=  n.  This  second  merging  algorithm  and 
the  aforementioned  algorithm  for  finding  the  maximum  are  actually 
implementations  of  algorithms  designed  by  Valiant  [Va-75]  for  a 
(loose)  comparison  model  of  parallel  computation. 


3.   Sorting 


elements  in  time 


0((n/p)  log  n  +  log  n  log  p) 
for  p  <=  n  (optimal  for  p  <=  n/log  n),  and  in  time 

O(((log  n)  »»2)/log  (p/n  +  1))  +  log  n) 
for  p  >=  n  (=0(k  log  n)  if  p  =  [n»«(1  +  1/k)]). 


A  time  of  0(k  log  n)  for  p  =  [  n  *•-•  (  1 +1 /k  )  ]  processors  for  this  problem 
was  also  achieved  by  Hirschberg  [Hi-78]  and  Preparata  [Pr-78]; 
however,  the  Shiloach-Vishkin  algorithm  is  substantially  simpler. 
Using  ideas  of  Preparata,  Borodin  and  Hopcroft  achieved  a  time  of 
O(log  n)  for  p  =  n  log  n  for  the  sorting  problem.  Recently,  Ajtai, 
Komlos   and   Szemeredi   [AKS-83]   presented   a  sorting  network  of  size 
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O(nlo£  n)  and  depth  0(lo£  n).  This  obviosly  implies  an  alt^orithn  of 
depth  0(lo£  n)  for  n  processors.  This  is  an  excellent  asymptotic 
result.  It  requires,  hov;ever,  a  large  constant  and  seems  very 
couplicated.  IJath,  Ilaheshwari  and  Bhatt  [ni!B-8l]  show  hov/  to  use 
several  parallel  sorting  algorithms  for  the  parallel  computation  of 
convex  hulls  in  two  dimentions. 


H.       Computing  connected  components 
vertices  and  □  edges. 


of 


undirected   graph   with 


For  this  problem,  Shiloach  and  Vishkin  [SV-82a]  presented  an  algorithm 

of   time  complexity  O(log  n)  for  n  +  m  processors.   Chin,  Lara  and  Chen 

[CLC-32]  and  Vishkin  [Vi-8l]  gave,  independently,  an  0((n»«2)/p) 
algorithm  for  p  <=  (n2«2)/((log  n)««2) 

These  algorithms  for  the  connected  component  problem  improve  those 
presented  in  V/yllie  [Wy-79]  and  Hirschberg,  Chandra  and  Sarwate 
[KCS-79],  which  require  a  time  of 

O((log  n)"*2)  using  n  +  m  and  (n**2)/log  n  processors, 
respectively. 

Savage  and  Ja'Ja'  [SJ-81]  shov;  how  to  modify  the  connectivity 
algorithm  of  [HCS-79]  into  a  minimum-spanning-f orest  (MSF)  algorithm. 
This  method  is  used  by  Awerbuch  and  Shiloach  [AS-83]  (resp.  [CLC-82]) 
in  order  to  derive  an  MSF  algorithm  from  a  modification  of  the 
connectivity  algorithm  of  [SV-82a]  (resp.  their  connectivity 
algorithm).  These  MSF  algorithms  use  the  same  time  and  number  of 
processors  as  their  respective  connectivity  algorithms. 


5.   Computing  biconnected  components  of  an   undirected 
vertices  and  m  edges. 


graph   with 


n 


Tarjan  and  Vishkin  [TV-83]  present  a  new  algorithm  for  this  problem 
avoiding  Depth-First-Search  (which  seems  inherently  serial).  The 
algorithm  consists  essentially  of  a  reduction  of  the  biconnectivi ty 
problem  into  the  connectivity  problem  and  can  be  implemented  to  yield: 
(a)  Time  O(log  n)  for  n+m  processors;  (b)  Time  0(n«*2/p)  for  p  <= 
(n*-2)/((log  n)"*2);  as  well  as  (c)  Linear  serial  time.  The  two  last 
results  were  obtained  by  Tsin  and  Chin  [TC-82].  It  seems  nontrivial 
to  derive  an  efficient  bi cnnect ivi ty  algorithm  for  sparse  graphs  using 
their  ideas. 

[TV-83]  introduces  a  new  "Eulerian  circuit"  technique  that  provides 
for  a  variety  of  parallel  algorithms  on  trees  in  time  O(log  n).  This 
improves  on  the  known  "centroid  decomposition"  technique  that  yields 
algorithms  on  trees  with  time  estimate  O((log  n)**2).  See  Megiddo 
[Me-8l]  for  an  example  where  the  latter  technique  is  used  and 
discussed.  It  is  an  interesting  exercise  to  observe  that  this 
technique  is  actually  the  backbone  of  an  earlier  paper  by  Winograd 
[VIi-75]. 


These  bi conne c t ivi ty  parallel  algorithms  improve  earlier  results 
obtained  by  Savage  and  Ja'Ja'  [SJ-8I]  and  Eckstein  [Ec-79b].  The 
first  paper  actually  presents  two  biconnectivi ty  algorithms.  One 
requires  O((log  n)**2)  time  using  (n*«3)log  n  processors,  and  the 
other  requires  O(((log  n)««2)log  k)  time,  where  k   is   the   number   of 
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biconnected  components  of  the  graph,  using  an+(n**2)log  n  processors. 
The  algorithn  of  Eckstein  requires  0 ( ( ( ( n+m ) /p ) +d ) ( log  p+((log 
n)*"-2))))  tiiae,  v/here  d  is  the  radius  of  the  graph. 

The  implementations  of  the  Tarjan-Vishkin  (and  Tsin-Chin) 
biconne ct i vi ty  algorithm  can  be  utilized  to  find  bridges  of  a  graph  in 
the  saae  tine  and  number  of  processors. 

6.  Data-structures.   Operating  in  parallel  on  2-3  trees. 

Suppose  a  2-3  tree  v/ith  n  leaves  is  stored  in  the  memory,  a1,...,ak 
are  data  that  may  or  may  not  be  stored  in  the  leaves  and  v;e  are  given 
k  processrs  P1,...,Pk.  Suppose  processor  Pi  knows  ai,  for  each  1  <=  i 
<=  k.  Paul,Vishkin  and  VJagener  [PVU-83]  show  how  to  search  for 
a1,...,ak  in  the  tree,  how  to  insert  these  elements  and  hov;  to  delete 
the.a  from  the  tree  in  O(log  n  +log  k)  time.  It  is  also  shown  that  the 
same  time  is  required  in  order  to  split  the  tree  with  respect  to  these 
elements  or  to  perform  the  union  of  k+1  "range-disjoint"  2-3  trees 
into  a  single  2-3  tree. 

7.  Shiloach  and  Vishkin  [SV-82b]  give  an  algorithm  for  finding 
maximum  flow  in  a  network  of  n  vertices  within  a  depth  of  0(((n«=3)log 
n ) /p )  for  p  <=  n . 

This  parallel  nax-flov;  algorithm  is  very  simple  conceptually.  It 
should  be  noted  that  all  known  efficient  one-processor  algorithms  for 
this  problem  are  highly  sequential  in  nature  and  there  is  no  obvious 
way  to  increase  their  speed  by  using  many  processors.  The 
Shiloach-Vishkin  algorithm  for  this  problem  introduces  essentially  new 
graph-theoretical  ideas,  among  other  things  leading  to  a  simple  new 
0(n*«3)  sequential  algorithm.  This  sequential  algorithm,  whose 
construction  and  time-estimate  both  reflect  notions  of  parallelism, 
shows  that  the  discipline  of  parallel  computation  can  enrich  the  field 
of  sequential  algorithms  as  well. 

On  the  negative  side,  Gol dschlager ,  Shaw  and  Staples  [GSS-82]  proved 
that  the  max-flow  problem  is  log  space  complete  for  P.  This  implies 
that  if  there  is  a  parallel  algorithm  for  the  max-flow  problem  that 
runs  in  time  O(log  n)"-«k)  for  some  k,  then  such  an  algorithm  exists 
for  all  problems  in  P  -  an  unlikely  possibility. 


8.   Numerical  parallel  algorithms, 
Heller  [Ke-78]. 


For 


comprehesive   survey   sei 


9.  Gurevich,  Stockmeyer  and  Vishkin  [GSV-82]  describe  a  general 
sequential  technique  for  solving  certain  NP-hard  graph  problems  in 
time  that  is  exponential  in  a  parameter  k  defined  as  the  maximum,  over 
all  biconnected  components  C  of  the  graph,  of  the  number  of  the 
minimum  number  of  edges  that  must  be  added  to  a  tree  to  produce  C. 
For  a  connected  graph,  k  is  no  more  than  the  number  of  edges  of  the 
graph  minus  the  number  of  vertices  plus  one.  Coppersmith  and  Vishkin 
[CV-82]  present  an  algorithm  which  finds  a  minimum  vertex  cover  in  a 
graph.  The  algorithm  combines  two  main  approaches  for  coping  with 
MP-compl e teness ,  and  thereby  achieves  a  better  running  time  than  known 
algorithms  that  use  only  one  of  these  approaches.  These  two  papers 
also   present   parallel  implementations  which  are  optimal  for  a  fairly 


v;ide  range  for  the  number  of  processors. 


III.   Lov;er   Bounds   on   Parallel   Coaputing   Time 
Between  Different  Models  of  Parallel  Computation. 


^nd   Relationships 


Various  abstract  parallel  coaputation  models  less   powerful   than  our 

CRCVJ   PRAM   model  have  appeared  in  the  literature.   Among  them  we  wish 

to  mention  the  concurrent-read  exclusive-wri te  PRAM   (CREW   PRAM)  and 

the   Exclusive-read  Exclusi ve-V,'ri te  PRAM  (EREU  PRAM).   The  latter  does 

not  allow  simultaneous  access  to  the  same  memory  location  by  more  than 
one  processor  while  the  former  allows  read  conflicts  but  not  write 
conf 1 icts . 

The  work  of  Cook  and  Dwork  [CD-82]  (resp.  Snir  [Sn-82])  implies  that 
a  CRCi;  PRAM  (resp.   CREV/)  is  more  powerful  than  a  CREV/  (resp.   EREW). 

Stockmeyer  and  Vishkin  [SV-82c]  establish  a  relation  between  unbounded 
fan-in  circuits  as  described  in  Section  2.  This  relation  serves  as  a 
link  in  deriving  two  results. 

(a)  A  lower  bound  for  the  time  required  to  compute  parity.  A  result 
of  Furst,  Saxe,  and  Sipser  [FSS-81]  for  unbounded  fan-in  circuits  is 
used.  It  is  shown  that  it  is  impossible  to  compute  parity  in  constant 
time  using  a  polynomial  number  of  processors  in  the  CRCU  PRAM  model. 

(b)  There  exists  a  constant  c  such  that  for  t>o,  CRCV;  PRAM's  operating 
in  constant  time  t  \;ith  a  polynomial  number  of  processors  have  stricly 
more  power  than  do  those  operating  in  in  time  t-1  v;ith  a  polynomial 
number  of  processors.  This  is  implied  by  a  recent  result  of  Sipser 
[Si-83]  for  unbounded  fan  in  circuits. 
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Chandra,  Stockmeyer  and  Vishkin  (   [CSV-82a]   and   [C 
reducibil i ties  for  combinational  logic  circuits  of  po 
constant  depth   containing   AIJD's,   OR's   and   IIOT's, 
fan-in.    Two   such   reducibil i ties   are   defined,   a 
equivalences  among  several  basic  problems   such   as 
integer   multiplication,   graph   connectivity,   bipar 
network  flow  are  given.   Certain  problems  are   shown 
with   respect   to   these   reducibil i ties   in   the  fol 
classes:     deterministic     logarithmic     space, 
logarithmic   space,   and   deterministic   polynomial 
bounds  on  the  size-depth   (unbounded   fan-in)   circui 
symmetric   Boolean   functions   are  established.   By  t 
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result  of  Stockneyer  and  Vishkin  [SV-82]  all  the  theory  developed  in 
this  paper  is  valid  for  the  CRCW  PRAM  when  instead  of  polynonial  size 
one  considers  a  polynoaial  number  of  processors  and  the  depth  of  a 
circuit  is  replaced  by  depth  (running  time)  in  the  CRCVJ  PRAi!. 


The  folio  v;ins  two  papers  prove  interesting  lov/er  bounds  for  a 
comparison  model  of  parallel  computation.  Valiant  [Va-75]  gives  a 
lower  bound  for  the  problem  of  finding  the  maximum  of  n  elements. 
Borodin  and  Hopcroft  [BH-82]  present  a  lower  bound  for  the  merging 
problem. 


IV. 


iPlementation  of  Abstract  Parallel  ism  Model s 


The  paper  Vishkin  [Vi-82]  introduces  an  implementable  general  purpose 
parallel  computer  model  called  the  Parallel  Design  Distributed 
Implementation  (PDDI).  This  is  a  synchronous  distributed  machine  in 
which  each  processor  is  connected  to  at  most  four  others.  This 
machine  model  employs  a  communication  graph  no  more  involved  than  a 
sorting  network  followed  by  a  merging  network. 

A  more  general  abstract  synchronous  parallel  random-access  model  of 
computation,  called  Super  PRAM,  is  introduced  in  the  same  paper.  In 
addition  to  the  capabilities  of  the  CRCVJ  PRAM,  this  model  allows 
computation  of  partial  suns  and  searching  in  an  array  in  unit  time. 
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ficient  representation  of  this  latter  model  in  the  PDDI  machine 
iven.  More  precisely,  suppose  that  for  some  t  and  x  there  is  a 
lei  algorithm  for  the  Super  PRAM  that  operates  in  time  0(t/p) 
p  processors  for  all  p  <=  x.  This  can  be  converted  into  an 
ithm  in  the  PDDI  machine  that  requires  time  0(t/s)  for  all  s  <= 
where  1  depends  on  the  choice  of  the  sorting  and  merging  networks 

is  the  number  of  "significant"  processors  used.  One  possible 
guration  results  in  O(s(log  s)*»2)  +  m  log  m)  degenerate 
ssors  and  l  =  ((log  s)**2)  +  log  m  v;here  m  is  the  size  of  the 
n  memory.  A  second  possible  conf igara tion,  which  uses  the 
ementioned)  sorting  network  of  [AKS-83]  enables  us  to  replace  the 

s)*-2  term  by  log  s  in  the  last  two  formulae.  However,  the 
ants  involved  are  substantially  bigger. 
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Sinulation  of  tightly  coupled  parallel  computation  models  by  a 
distributed  model  of  computation  is  also  studied  in  a  few  other 
papers . 

Lev,  Pippen-er,  and  Valiant  [LPV-81]  mention  simulations  of  the  EREW 
PRAi:  model.  A  technique,  suggested  by  Eckstein  [EG-79a],  for 
eliminating  simultaneous  access  to  a  single  memory  location  can  be 
used  to  translate  CRCU  PRAM  algorithms  into  algorithms  for  synchronous 
distributed  machines.  Use  of  this  reduction  multiplies  time 
requirements  by  O(log  m  +  log  p)  and  processor  requirements  by  0(m), 
v;here  p  is  the  number  of  processors  and  ra  the  size  of  the  common 
memory  used  by  the  original  CRCW  PRAH  algorithm.  Use  of  pipelining  in 
a  way  similar  to  [Vi-82]  can  further  improve  this  solution.  The  main 
disadvantage  of  this  approach  is  the  large  number  of  "auxiliary" 
processors  required.  A  similar  technique  is  used  in  Lev  [Lev-80]  to 
solve  another  problem  and  in  one  of  the  sorting  algorithms  in  Thompson 
[Th-82].  This  technique  is  called  by  various  names  in  the  literature. 
Me  use  here  the  name  "Orthogonal  Trees".  See  [Th-82],  for  more 
references  to  the  use  of  this  simple  technique  and  for  various  names 
that  were  suggested.  For  the  sake  of  this  survey,  the  contribution  of 
Awerbuch,  Israeli  and  Shiloach  [AIS-83]  can  be  summarized  as  another 
application  of  this  technique.  They  propose  an  automatic  traslation 
of  the  CRCU  PRAM  into  a  fixed  interconnection  pattern,  synchronous 
distributed  machine, such  that;  any  CRCV,'  PRAII  algorithm  A  that  uses  p 
processors  and  time  t  can  be  run  on  that  machine  using  p**2  processors 
and  O(tlog  p)  time,  independently  of  the  size  of  the  shared  memory 
that  A  uses.  However,  for  each  shared  memory  location,  p  local  memory 
cells  of  the  machine  must  be  kept,  which  is  a  big  disadvantage. 

Borodin  and  Kopcroft  [BK-82]  outline  another  simulation,  for  the  case 
p  =  m.  This  paper  applies  sorting  in  the  same  way  as  [Vi-82]  (a 
similar  use  of  sorting  appears  earlier  also  in  Vishkin  [Vi-83d]. 
[Vi-82]  improves  the  time  efficiency  of  [BH-82]  for  the  case  p  =  m. 

The  comprehensive  paper  [Sc-80]  describes  the  "Paracomputer",  a  model 
of  parallel  computation  very  similar  to  our  CRCU  PRAM  and  proposes  the 
former  as  a  model  suitable  for  studying  theoretical  aspects  of 
parallel  computation.  Various  Paracomputer  algorithms  are  implemented 
in  the  "Ul  tracomputer "  (a  perfect  shuffle  interconnection  m,achine). 
The  paper  Gottlieb  et  al.  [GGKMRS-83]  suggests  to  replace  the 
CRCW-PRAM-Paracomputer  by  a  Fe t ch-and-Add-PR AH-Paracomputer  and  the 
Pefect-Shufle-Ul tracomputer  by  another  interconnection  network.  See 
the  next  section  for  the  differences  between  the  Fetch-and-Add  PRAM 
and  the  CRCU  PRAM.  The  automatic  procedure  for  the  simulation  of  the 
Paracomputer  by  the  Ul tracomputer  which  is  suggested  is  claimed  to 
satisfy  a  good  average-case  criterion.  No  claims  are  made  regarding 
worst-case  criteria  that  this  simulation  satisfies. 

The  converse  problem  of  simulating  algorithms  given  for  a  synchronous 
distributed  machine  by  CRCU  PRAM  algorithms  is  studied  in  [Vi-82].  It 
is  shown  that  this  simulation  can  be  accomplished  within  a  constant 
time  factor  without  using  any  additional  processors.  A  related 
simulation  was  developed  by  Galil  and  Paul  [GP-83].  The  worst  case 
time  requirement  of  their  simulation  is  improved  by  [Vi-82]  for 
comparable  cases.  Moreover,  the  Vishkin  simulation  allows  more 
general   patterns   of   communication   in   one  time  unit  for  the  design 
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space  and  therefore,  equips  the  designer  vjith  more  powerful  design 
tools;  examples  are  given  in  the  paper.  IJote  finally  that  the 
subsection  on  "alignment  networks"  by  Kuck  [Ku-77]  contains  a  survey 
of  known  interconnection  networks  for  processors  and  memories. 

V.   i:ore  on  the  Choice  of  a  Parallel  Model  oX  Conputation 

Let  A  be  a  coniaon  aenory  address  and  let  el  be  a  local  register  of 
processor  PI.  Following  Gottlieb  and  Kruskal  [GK-81],  define  the 
Fe tch-and-Add  (FandA)  instruction  as  follows.  If  processor  PI 
perforas  a  FandA(A,e1)  and  no  other  processor  performs  at  the  same 
tine  an  instruction  that  relates  to  address  A  then  the  content  of  A  is 
transmitted  to  processor  PI  and  address  A  is  assigned  with  A+el. 
Suppose  that  several  processor  perform  simultaneously  FandA 
instructions  that  relate  to  A.  The  result  is  defined  to  be  as  if  they 
performed  these  instructions  serially  in  some  order.  The  FandA  PRAM 
is  a  CRCI.'  PHAII  that  allows  these  FandA  instructions. 

Vishkin  [Vi-83a]  discusses  the  general  problem  of  choosing  an  abstract 
model  of  parallel  computation  to  be  simulated  by  either  an  EREU  PRAM 
or  a  synchronous  distributed  machine  in  which  each  processor  is 
connected  to  at  most  c  others,  where  c  is  some  constant.  The 
aforementioned  principle  of  choosing  the  most  permissive  model  of 
parallel  conputation  as  long  as  the  cost  of  computational  resources 
does  not  increase  is  applied.  Two  theorems  are  proved  in  the  paper. 
The  first  (resp.  second)  theorem  asserts  that  for  every  EREVJ  PRAM 
(resp.  synchronous  distributed)  machine  and  every  "reasonable" 
simulation  of  the  CREU  PRAM  into  this  machine,  there  exists  a 
simulation  of  the  FandA  PRAM  into  the  same  machine  that  uses  the  same 
order  of  time  and  sizes  of  (corresponding)  memories.  By  "reasonable" 
we  mean  that  the  simulation  of  the  CREW  PRAM  by  this  machine  has  to 
satisfy  a  few  assumptions.  This  implies  that  if  we  consider  the 
choice  of  an  abstract  model  of  parallel  computation  that  allows  the 
concurrent-read  form  of  simultaneous  access  to  the  same  memory 
location  then  we  may  as  well  choose  the  more  powerful  FandA  PRAM.  It 
is  interesting  to  note  that  the  FandA  PRAM  was  chosen  as  the  abstract 
model  of  parallel  computation  for  the  MYU-Ul tracomput er .  See  Gottlieb 
et  al.   [GGKMRS-83]. 

The  proofs  of  these  two  theorems  use  a  promising  methodological  notion 
for  composing  a  new  procedure  out  of  existing  ones  in  serial,  parallel 
or  distributed  environments.  This  notion  is  called  execution 
CO  IT  position.  In  each  of  these  two  proofs  we  do  the  follo\jing.  Say 
that  a  cycle  of  the  FandA  PRAM  is  given.  Ue  have  to  simulate  this 
cycle  by  the  machine  (the  EREV/  PRAM  or  the  synchronous  distributed 
machine).  Instead,  a  corresponding  cycle  in  the-  CREW  PRAM  is 
simulated  and  all  its  intermediate  computations  are  recorded.  These 
intermediate  coaputa tions  are  then  used  to  form  a  simulation  ■  of  the 
original  cycle  of  the  FandA  PRAM  into  the  machine.  The  theorems  do 
not  specify  the  simulation  of  the  CREU  PRAM  into  the  machine  and  the 
second  theorem  does  not  specify  the  interconnections  among  processors 
in  the  distributed  machine.  This  modification  of  a  non-specified 
simulation  by  a  non-specified  machine  into  another  simulation  (by  the 
same  machine)  demonstrates  some  of  the  power  of  execution 
composi  tions . 
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In  order  to  proceed  \ie  should  be  a  bit  nore  precise.  Given  a  shared 
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Vishkin  and  v;i£;derson  [  V^J- 83  ]  present  the  idea  of  dynamically  changing 
locations  of  addresses  aiaonj  nodules  throughout  the  performance  of  an 
al3orithin.  It  is  shown  how  to  use  this  idea  in  order  to  solve  the 
granularity  problera  in  constant  time  utilizing  only  as  many  modules  as 
the  number  of  processors.  For  instance,  they  shov/  that  this  can  be 
done  for  straight-line  programs.  Such  programs  seems  to  be  typical 
for  numerical  computations. 

A  c  k"  n  o  vj  1  e  d  n  e  m  e  n  t 

I  am  grateful  to  A.  Borodin,  R.  Cole,  A.  Gottlieb  and  P.  Spirakis 
for  reading  the  manuscript  and  for  their  helpful  remarks. 
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